Pulsar Star Classification Problem

Pulsar Stars are neutron stars that look like flickering stars from earth as they radiate two beams of steady light in opossite directions, but because they are rotating the light that gets to earth looks flickering. They are fascinating by itselves but they also have plenty of applications. They have been used to test the general relativity theory under extreme gravitational conditions, as objects to create spacial maps, to create prescise clocks, probes of interstellar medium, probes of space time and to search for gravitational waves among other things. So identifying posible pulsar stars can be of great help to astronomers and scientitis for use in such applications. https://en.wikipedia.org/wiki/Pulsar https://www.space.com/32661-pulsars.html

The data of pulsar candidates comes from Kaggel and was collected during the High Time Resolution Universe Survey https://www.kaggle.com/datasets/colearninglounge/predicting-pulsar-starintermediate It contains statistics of the collected data for each object in the universe survey with some preprocesed data to be in kaggle. It is clearly labeled for a machine learning classification problem, which is what I will do in this notebook.

Exploratory Data Analysis and Data Cleaning.

First we load the data from the CSV file and change the names of the columns make it easier to deal with them. The cvs file was downloaded from 'https://www.kaggle.com/datasets/colearninglounge/predicting-pulsar-starintermediate/download?datasetVersionNumber=1'

Let's start the data analisys by looking at the histogram of each feature in the data.

From the histograms we can see there are different scales in the features, so we will probaly need to scale or regularize the features. In the target class histogram we see that the data is imbalanced. Some of the histograms are heavealy charged to the left, this makes me suspect of outliers. So i will check for them in scatter plots where the color is the target class.

From the scatter plots we see there aren't really outliers but the data is sparse for some large values of some of the features.

Because the scales of each feature are very different I will scale all the data using min-max scaling so every feature is between 0 and 1 and when I use methods like regression the data will be nicely scaled. This kind of scaling doesnt change the correlation between the features.

Now let's look at the information of the data, to check exactly how many instances we have and how many null values for each feature.

We see that the data has 8 features and a target indicating if it is a pulsar or not. The problem to solve is a classification problem where we use some or all the features to predict if the light used to get the features comes from a pulsar or not.

The data has a total of 12528 instances, but some of the features have null data, such as Excess_kurtosis_of_the_integrated_profile that has almost the 14% of its rows as null, the Standard_deviation_of_the_DM_SNR_curve has around 9% of null values, and the Skewness_of_the_DM_SNR_curve has 5% of it's values null.

To decide what to do with the null values we take a look at both the correlation between al the features, including if the data comes from a pulsar

Correlation between the features.

The correlation matrix between the features can tell use what will be the most important feature for the classification as well as posible correlations to be aware of between the features.

From the correlation plot we learn that the highest correlation between the target class and the features is with Excess kurtosis of the integrated profile at 0.79, which is also the feature with the most missing values. So we need to be carefull on how to deal with such values. To have a bot more information about how the target class relates with the rest of the features and how do they related with each other we look at the pair plot.

The pair plot of each feature with the color of each data point representing the target.

In the pair plot we can observe that the target class orange if =1 means its a pulsar and blue = 0 means is not, looks easier to separate in the plots by orange and blue involving the Integrated profile over the ones of the DM SNR curve. So I expect the resulting model to depend more on first 4 features.

By rule of thumb from how many missing values each feature has,we should get rid of the Excess_kurtosis_of_the_integrated_profile as it has more than 10% of missing values, however because it is the one with the highest correlation with the target class, I will try to input the missing values by trying to see if I can fit that feature using multilinear regresion with the rest of the features from the integrated profile.

From both the correlation and the pair plot we see that the excess kurtosis of the integrated profile is highly correlated with the skewness and the mean of the integrated profile if we look at the plot of Skweness vs Excess Kurtosis we can see the shape it's close to the square root. Let's look at the plot by itself with the color of the dots representing the standard deviation of the integrated profile to see if it's adds further information.

In the plot above the color is the standar deviation of the integrated profile and we can see it does add information to the plot. The green line than envolves the data is the (0.7*Skeness)^0.55+0.2 and I got to it with trial and error by eye, so if we combine the Skewness^0.55 with the Santandar deviation and the Mean in a linear regression model we might get a good aproximation of the excess kurtosis to substitute the missing values.

Linear Regression to get the missing Excess Kurtosis of the integrated profile.

Using the data for the integrated profile and multilinear regression I fit a model to get the missing data points for the Excess Kurtosis.

As we can see from the model summary the adjusted R^2 has a pretty high value of 0.971 indicating that it is a good prediction and all the p_values for the coeficients are below 0.001. To get the adjusted R_Squared of the test data I use the function I wrote to get it in module 3.

We got a pretty good R square value for the test data too. If we plot the results of the prediction against the actual Excess Kurtosis values we get the next plot. With that target class as a color. Blue dots correspond to target class 0 while red dots corresponds to 1. A perfect prediction would be a line of slope 1 like the green line on the plot.

We can see that from 0.2 to 1 the prediction is pretty good, but it brokes down for values betwenn 0 and 0.2 as expected. However it seems that for small Excess kurtosis of the integrated profile, the sign doesn't change the prediction of the result value so I will imput the missing values with the model aproximation.

From the correlation matrix and the pair plot we see that the other 2 features with missing values don't seem to be that important to predict the target class, so we will drop the Standard_deviation_of_the_DM_SNR_curve and the missing rows of the Skewness_of_the_DM_SNR_curve that has only 5% missing.

As the final step of the exploratory data analis and data clean up lets see how imbalanced is the target class, with the mean value of the target class. A perfectly balnced target class would have 0.5 as a mean.

This means tha the data is a bit imbalanced as more than 90% of it is from not pulsar.

Machine Learning Classification Problem.

The data is now ready to solve the classification problem that will determine if a star its a pulsar or not. I will try different models and see which one works best. First thing to do is separate the data in train and test data.

And use the functions from module 4 to evaluate the precision and recall and F1, because the data is imbalanced, we wont use only accuracy to see how well the models do. As a model that predict 0 always would have a 0.91 Accuracy. In this case I will look for the best F1 to tune the models parameters or hyperparameters. As F1 contains both precision and recall, so it tells if we are either missing to many positive values or overpredicting too many positive values. We want to find some balance as we dont want to give to many not pulsars to evaluate them as pulsar neither miss too many good candidates.

I will fit 7 different classifiers to the data and compare its evaluation metrics. The classifiers will be:

  1. Logistic Regression
  2. K-Neiarest Neighbors
  3. Decision Tree
  4. Random Forest
  5. Extra Tree
  6. AdaBoost
  7. Linear Support Vector.

The evaluation metrics to compare them will be:

For each classifier I will use the sklearn function of the classifier, fit it with the train data and do a parameter or hyperparameter search to get the "best" clasifier, I will the use such classifier to predict the test data, and when the model allows it to also get the predicted probability of the test data in order to get the ROC curve. I will also show the confusion matrix of each classifier in order to see how many data points wher missclasified and how for each model.

Logistic regresion.

We start by fitting a simple logistic regression model with the train data specifing the class weight as balanced due to the imbalance on our data.

As expected the most important feature in the logisitic regression, that is the one with the largest coeficient in the logistic regresion model was the Escess Kurtosis of the integrated profile.

To evaluate how well the model did we calculate the precision, the recall, the ROC curve and the confusion matrix.

The ROC curve shows the model works pretty well. However it seems like we are over predicting the True values. Lets see if further models get us a better F1.

KNN model

Given that the number of features is not big and that we already scaled all the features we can try a KNN model to se how it performs against the rest. We will explore the number of neighbors to find the best knn classifier.

The KNN model had a better precision than the logistic regresion as the number of false positives it gives is small. However it has more false negatives, this means we miss more true positives of pulsar candidates.

Decision Tree

A simple decision tree and prune it for differents alphas, like we did in Module 4.

We prune for a range of alphas and find the accuracy for each alpha.

We choose the alpha where the test F1 is the largest.

With this very deep decision tree we improved the precision and the F1 values over the Logistic regression model results but the recall and AUC value decreased. In other words the False Positives decreased but the False Negatives increased, meaning with the decision tree we miss out on pulsar candidates.

Even with the prunning we got a pretty big the Decision Tree in order to get a better F1 metric than for logistic regresion the depth is 18. Such a deep tree usually overfits and we dint get the simplicity of a shaloow Decision Tree. A Decision tree ensemble method, with small trees could give better results

Ensemble trees.

Random Forest Classifier

A Random Forest Classifier like the one we saw in module 5, with trees of max depth of 2, and using only 5% of the sample to train each tree.

Extra Tree Classifier

An extra tree classifier or Extremely Randomized Trees, is similar to a random forest, with 2 differences, The basic extra tree algorithm uses all the train data in each decision tree, while the random forest bootstraps and uses samples, the second and more important difference is that the extra tree chooses the cut point to make the tree randomly, while the random forest chooses the optimum split. (https://quantdare.com/what-is-the-difference-between-extra-trees-and-random-forest/ ) becuase of it, the extra tree algorithm is faster than the random forest. Let's try it out and maybe even use more samples to create our classificator as it's faster. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html?highlight=ensemble+extra+tree+classifier

AdaBoost Classifier

Lets try a boosted ensemble method like adaboost we saw in module 5.

Support vector Machine.

From the pair plot looked like they could be a linear boundary so lets try a linear Support Vector Machine and do a grid search for the best parameter c.

Linear suport vector doesn't give you the probability of getting the target class, as the decision making is not by probabilities but by set boundaries, because of it there is no ROC curve or AUC value for this classifier.

Results

This is the final table comparing the evalution metrics of each classifier.

First we see the difference in values in the AUC for all but the SVL.

The Extra Tree classifier achieved the best AUC values, however the difference with all but the single Decision Tree is neglible as they are all around 0.95

We can visualize the rest of the evalution metrics in a parallel plot where the color of the line indicates the classifier and each vertical axis correspond to a evaluation metric.

The choice of classifier will depend on what its more important for us when predicting pulsar candidates, if the intended result is to miss as little pulsar candidates as posible then the Logistic Regression model is the best bet as it has the highest recall of all the classifiers. However if that was the metric we where after we would have to repeat the parameter and hyper parameter search for all the models using recall as the scoring. Here I focused on the best balance of results meaning the highest F1, so It wouldnt miss too many pulsar candidates but also wouldnt get too many false positives that would be a waist of resources looking at posible pulsars where they don't exist. In the F1 race the biggest looser was the Logistic Regression classifier with a value of ~1.6. The decision tree and the Random Forest are around 1.7 and the other 4 models have really close F1 values around 1.8.

If what we care the most is precision KNN and Linear Support Vector got the best results here with values around 0.96, but the ensemble methods Extra Tree and Ababoost allso had presicion results above 0.9.

Conclusions

All the classificators based on different models did better than the trivial all negative or random 1/10 target. And depending on what we wanted we could choose a different model. I would say the 7 models worked pretty well.

The main decisions to get to this results where in the data imputing, I would like to see what happende if instead of getting the missing values of the excess kurtosis from the other features I would have erased the column, or deleting the instances with missing values. However I choosed this as it could have been posible for the test data be missing the excess kurtosis and from the correlation matrix looked like a really important feature in the prediction of the target class.